{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmarks for String Similarity Scoring Functions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Install the most commonly used Python packages for string similarity scoring. This includes JellyFish for Levenshtein and Levenshten-Damerau distance, RapidFuzz for Levenshtein distance, and BioPython for Needleman-Wunsh scores among others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install stringzilla # https://github.com/ashvardanian/stringzilla\n",
    "\n",
    "# For Levenshtein distance:\n",
    "!pip install rapidfuzz  # https://github.com/rapidfuzz/RapidFuzz\n",
    "!pip install python-Levenshtein  # https://github.com/maxbachmann/python-Levenshtein\n",
    "!pip install levenshtein # https://github.com/maxbachmann/Levenshtein\n",
    "!pip install jellyfish # https://github.com/jamesturk/jellyfish/\n",
    "!pip install editdistance # https://github.com/roy-ht/editdistance\n",
    "!pip install distance # https://github.com/doukremt/distance\n",
    "!pip install polyleven # https://github.com/fujimotos/polyleven\n",
    "!pip install edlib # https://github.com/Martinsos/edlib\n",
    "!pip install nltk # https://github.com/nltk/nltk\n",
    "\n",
    "# For Needleman-Wunsch and Smith-Waterman algorithms with custom scoring matrices:\n",
    "!pip install biopython # https://github.com/biopython/biopython\n",
    "\n",
    "# For data manipulation:\n",
    "!pip install pandas\n",
    "!pip install tabulate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's import all those libraries, and check basic functinality of a couple of simple English examples, as well as Unicode, outputting the results in a table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jellyfish as jf\n",
    "import Levenshtein as le\n",
    "import editdistance as ed\n",
    "from rapidfuzz.distance import Levenshtein as rf\n",
    "from nltk.metrics.distance import edit_distance as nltk_ed\n",
    "import edlib\n",
    "import stringzilla as sz\n",
    "\n",
    "import pandas as pd\n",
    "from tabulate import tabulate\n",
    "\n",
    "# Define the examples\n",
    "examples = [\n",
    "    ('apple', 'aple'),\n",
    "    ('αβγδ', 'αγδ'),\n",
    "    # ('مرحبا بالعالم', 'مرحبا يا عالم'), # \"Hello World\" vs \"Welcome to the World\" ?\n",
    "    ('école', 'école'),  # etter \"é\" as a single character vs \"e\" + \"´\"\n",
    "    ('Schön', 'Scho\\u0308n'),  # \"ö\" represented as \"o\" + \"¨\"\n",
    "    ('💖', '💗'),  # 4-byte emojis: Different hearts\n",
    "    ('𠜎 𠜱 𠝹 𠱓', '𠜎𠜱𠝹𠱓'),  # Ancient Chinese characters, no spaces vs spaces\n",
    "    ('München', 'Muenchen'),  # German name with umlaut vs. its transcription\n",
    "    ('façade', 'facade'),  # \"ç\" represented as \"c\" with cedilla vs. plain \"c\"\n",
    "    ('こんにちは世界', 'こんばんは世界'),  # Japanese: \"Good morning world\" vs \"Good evening world\"\n",
    "    ('👩‍👩‍👧‍👦', '👨‍👩‍👧‍👦'),  # Family emojis with different compositions\n",
    "    ('Data科学123', 'Data科學321'),\n",
    "    ('🙂🌍🚀', '🙂🌎✨'),\n",
    "]\n",
    "\n",
    "results = []\n",
    "for example in examples:\n",
    "    example_str = example[0] + ' vs ' + example[1]\n",
    "    jellyfish_distance = jf.levenshtein_distance(example[0], example[1])\n",
    "    levenshtein_distance = le.distance(example[0], example[1])\n",
    "    rapidfuzz_distance = rf.distance(example[0], example[1])\n",
    "    editdistance_distance = ed.eval(example[0], example[1])\n",
    "    nltk_distance = nltk_ed(example[0], example[1])\n",
    "    stringzilla_bytes = sz.edit_distance(example[0], example[1])\n",
    "    stringzilla_chars = sz.edit_distance_unicode(example[0], example[1])\n",
    "    results.append({\n",
    "        'Example': example_str,\n",
    "        'Jellyfish': jellyfish_distance,\n",
    "        'Levenshtein': levenshtein_distance,\n",
    "        'RapidFuzz': rapidfuzz_distance,\n",
    "        'EditDistance': editdistance_distance,\n",
    "        'NLTK': nltk_distance,\n",
    "        'StringZilla (Unicode)': stringzilla_chars,\n",
    "        'StringZilla': stringzilla_bytes,\n",
    "    })\n",
    "\n",
    "# Convert results to a DataFrame for easy manipulation\n",
    "df = pd.DataFrame(results)\n",
    "\n",
    "# Use tabulate to print the table, setting tablefmt to \"grid\" for a nice grid-like table format\n",
    "print(tabulate(df, headers='keys', tablefmt=\"grid\"))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Levenshtein Distance Between Short English Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will be conducting benchmarks on a real-world dataset of English words. Let's download the dataset and load it into memory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget --no-clobber -O ../leipzig1M.txt https://introcs.cs.princeton.edu/python/42sort/leipzig1m.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "words = open(\"../xlsum.csv\", \"r\").read(1024 * 1024 * 1024).split()\n",
    "words = tuple(words)\n",
    "print(f\"{len(words):,} words\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "def checksum_distances(tokens, distance_function, n: int = 1000000):\n",
    "    distances_sum = 0\n",
    "    while n:\n",
    "        a = random.choice(tokens)\n",
    "        b = random.choice(tokens)\n",
    "        distances_sum += distance_function(a, b)\n",
    "        n -= 1\n",
    "    return distances_sum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, sz.edit_distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, jf.levenshtein_distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, nltk_ed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, ed.eval)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, rf.distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, le.distance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(words, lambda a, b: edlib.align(a, b, mode=\"NW\", task=\"distance\")[\"editDistance\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Levenshtein Distances for Longer Proteins"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "proteins = [''.join(random.choices('ACGT', k=10_000)) for _ in range(1_000)]\n",
    "print(f\"{len(proteins):,} proteins\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(proteins, sz.edit_distance, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(proteins, jf.levenshtein_distance, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(proteins, ed.eval, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(proteins, rf.distance, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(proteins, le.distance, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Needleman-Wunsch Alignment Scores Between Random Protein Sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For Needleman-Wunsh, let's generate some random protein sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Bio import Align\n",
    "from Bio.Align import substitution_matrices\n",
    "aligner = Align.PairwiseAligner()\n",
    "aligner.substitution_matrix = substitution_matrices.load(\"BLOSUM62\")\n",
    "aligner.open_gap_score = 1\n",
    "aligner.extend_gap_score = 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aligner.substitution_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's convert the BLOSUM matrix into a dense form with 256x256 elements. This will allow us to use the matrix with the Needleman-Wunsh algorithm implemented in StringZilla."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)\n",
    "subs_reconstructed = np.zeros((256, 256), dtype=np.int8)\n",
    "\n",
    "# Initialize all banned characters to a the largest possible penalty\n",
    "subs_reconstructed.fill(127)\n",
    "for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):\n",
    "    for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):\n",
    "        reconstructed_row = ord(packed_row_aminoacid)\n",
    "        reconstructed_column = ord(packed_column_aminoacid)\n",
    "        subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]\n",
    "\n",
    "(subs_reconstructed < 127).sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aligner.score(proteins[0], proteins[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sz.alignment_score(proteins[0], proteins[1], substitution_matrix=subs_reconstructed, gap_score=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "def sz_score(a, b): return sz.alignment_score(a, b, substitution_matrix=subs_reconstructed, gap_score=1)\n",
    "checksum_distances(proteins, sz_score, 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%timeit\n",
    "checksum_distances(proteins, aligner.score, 100)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
